# Global-Scale FPGA-Accelerated Deep Learning Inference with Microsoft's Project Brainwave

Gabriel Weisz Bing Engineering Microsoft





### Over 1 Million Catapult FPGAs in Our Data Centers





### Machine Learning

### **Accelerated Networking**



### Catapult FPGA Servers

Microsoft



0.5m QSFP cable from NIC to FPGA



~3m QSFP cable from FPGA to TOR [Slide courtesy Andrew Putnam]

### Catapult in the Data Center





### Catapult + Software = Hardware Microservices



Traditional software (CPU) server plane

- Interconnected FPGAs form a separate plane of computation
- FPGAs are used and managed independently from the CPU
- Applications are mapped across multiple FPGAs and CPUs



### Hardware Microservices for Real-Time Al

Real-Time AI = low latency without batching

Brainwave maps neural network models to multiple networkattached FPGAs

Weights are pinned to registers for low latency





### Mapping a Model Across FPGAs



### Bing Intelligent Search Powered By Brainwave



Bing launches new intelligent search features, powered by AI

Today we announced new Intelligent Search features for Bing, powered by AI, to give you answers faster, give you more comprehensive and complete information, and enable you to interact more naturally with your search engine.

#### Intelligent answers:

Intelligent answers leverage the latest state of the art machine reading comprehension, backed by Project Brainwave running on Intel's FPGAs, to read and analyze billions of documents to understand the web and help you more quickly and confidently get the answers you need.

Bing now uses deep neural networks to validate answers by aggregating across multiple reputable sources, rather than just one, so you can feel more confident about the answer you're getting.

| All                            | Images                                             | Videos                     | Maps                                    | News       | Shop       |        | My save |
|--------------------------------|----------------------------------------------------|----------------------------|-----------------------------------------|------------|------------|--------|---------|
| 281,00                         | 0 Results                                          | Any time 👻                 |                                         |            |            |        |         |
| 192                            | 28                                                 |                            |                                         |            |            |        |         |
|                                |                                                    |                            |                                         |            |            |        |         |
| Conso                          | olidated from r                                    | nultiple source            | S                                       |            |            |        |         |
| In 19                          | <b>28,</b> the Wo                                  | men's Colle                | ge was rer                              |            |            |        |         |
| In <b>19</b><br>Unive          | <b>28,</b> the Wo<br>ersity" in ho                 | men's Colle<br>nor of Pemb | ge was rer<br>broke Colle               | ege at the | University | of Car | nbridge |
| In <b>19</b><br>Unive<br>in En | 2 <b>8,</b> the Wo<br>ersity" in ho<br>gland. Roge | men's Colle                | ge was rer<br>broke Colle<br>one of the | ege at the | University | of Car | nbridge |

### FPGA-Accelerated model is **much** faster even though it is more complicated

| /\                                               |  |                                  |                                      |                                                  |  |
|--------------------------------------------------|--|----------------------------------|--------------------------------------|--------------------------------------------------|--|
|                                                  |  |                                  |                                      |                                                  |  |
|                                                  |  | CPU-only                         | Brainwave-accelerated                | Improvement                                      |  |
| Model details                                    |  | GRU 128x200 (x2)<br>+ W2Vec      | LSTM 500x200 (x8)<br>+ W2Vec         | Brainwave-accelerated                            |  |
| End-to-end latency per<br>Batch 1 request at 95% |  | 9 ms                             | 0.850 ms                             | model is > 10X larger<br>and > 10X lower latency |  |
|                                                  |  | Bir                              |                                      |                                                  |  |
|                                                  |  | CPU-only                         | Brainwave-accelerated                | Improvement                                      |  |
| Model details                                    |  | 1D CNN + W2Vec<br>(RNNs removed) | 1D CNN + W2Vec<br>+ GRU 500x500 (x4) | Brainwave-accelerated                            |  |
| End-to-end latency per<br>Batch 1 request at 95% |  | 15 ms                            | 5 ms                                 | - model is > 10X larger<br>and 3X lower latency  |  |

#### CPU vs Stratix V performance on production models



Brainwave Components

FPGA-based overlay ("NPU")

- Highly parameterized
- Supports multiple FPGA device generations
- Run-time programmable

Enterprise-grade software stack

- FPGA management
- Orchestration of computations
- Model compiler

icrosoft



### This talk focuses on the overlay

### Deep Learning Network Topologies







[Vaswani+, "Attention is all You Need", arXiv]

Recurrent Networks

# Convolutional

Networks

Transformer Networks

What computations do we need to support?

Microsoft

### Example RNN: Long Short-Term Memory (LSTM)



### Example RNN: Long-Short Term Memory (LSTM)



## (Almost) Everything is a Matrix Operation



How should we compute matrix operations?

Microsoft

### Primitives for Matrix Operations



**Brainwave uses Matrix-Vector Multiply** 



### What Else Has to Run on the FPGA?



# **Networks**

# Convolutional

**Networks** 

Transformer Networks

Neural networks are not just matrix multiply

licrosoft

### Brainwave Overlay Design Principles

Objectives

- Fast inferencing without batching
- Simple programmability with a single thread of control

Balance NPU complexity, instruction granularity, and flexibility

- All instructions operate on vectors of some native dimension
- Compute model is MVM and vector operations
- Neural networks decomposed into these operations

**Instruction Chaining** 

- Optimizes for matrix operation followed by vector operations
- Reduces the need for dependency analysis and multi-ported register files
- Allows a compact instruction encoding





### Brainwave Overlay Microarchitecture





### Brainwave Overlay Microarchitecture



### Brainwave Data Management



## **Overlay Specialization**



### Optimizing for Different MVM Operations

# Recurrent network with large matrices



Convolutional network with many small filter operations

Parallelize across patches





## **Overlay Specialization**



## Brainwave Firmware Programs

Firmware is a C program that runs on the control processor and makes the accelerator execute each particular neural network

Firmware manages control flow and data movements

Firmware maps the network's operations to chains of operations that the accelerator supports





### **DNN** Operators and Brainwave

### **Operations common in Deep Learning Networks**

| LSTM        | Scale      |
|-------------|------------|
| GRU         | Max Pool   |
| Convolution | Batch Norm |
| SoftMax     | Sigmoid    |
| Bias        | TanH       |

**Operations supported by** the Brainwave accelerator MVM Vector add/sub/max Hadamard product Sigmoid TanH Square root Inverse



### LSTM Sketch in 29 Lines of Firmware Code

| Loop over sequence       | 1.<br>2.                | <pre>void LSTM(int steps) { for (int t = 0; t &lt; steps; t++) {</pre> | 15.<br>16.               | v_rd(InitialVrf, h_prev);<br>mv_mul(Uc);                                            |                            |
|--------------------------|-------------------------|------------------------------------------------------------------------|--------------------------|-------------------------------------------------------------------------------------|----------------------------|
| Read<br>Input<br>Process | 3.<br>4.<br>5.          | v_rd(InputQ);<br>v_wr(InitialVrf, xt);<br>v_rd(InitialVrf, xt);        | 17.<br>18.<br>19.        | vv_add(xWc);<br>v_tanh();<br>vv_mul(it);                                            | Process<br>Hidden<br>State |
| Current<br>Input         | 6.<br>7.<br>8.          | mv_mul(Wf);<br>vv_add(bf);<br>v_wr(AddSubVrf, xWf);                    | 20.<br>21.<br>22.        | <pre>vv_add(ft_mod);<br/>v_wr(MultiplyVrf, c_prev);<br/>v_wr(InitialVrf, ct);</pre> |                            |
| Process<br>Hidden –      | 9.<br>10.<br>11.<br>12. | v_rd(InitialVrf, h_prev);<br>mv_mul(Uf);<br>vv_add(xWf);<br>v_sigm();  | 23.<br>24.<br>25.<br>26. | v_rd(InitialVrf, ct);<br>v_tanh();<br>vv_mul(ot);<br>v wr(InitialVrf, h prev);      | Compute<br>Next<br>Hidden  |
| State                    | 12.<br>13.<br>14.       | v_sign(),<br>vv_mul(c_prev);<br>v_wr(AddSubVrf, ft_mod);               | 20.<br>27.<br>28.<br>29. | }<br>v_wr(OutputQ);                                                                 | State and<br>Output        |

Firmware includes instruction chains that direct each functional unit

### Mapping LSTM Chains to the Accelerator



## **Convolutional Networks**



- Convolutions:
  - Convolutions slide a window over the image
  - The set of input data at each location is called a "patch"
  - The convolution computes a dot product between each patch and a set of filters.
  - The output of the convolution operation is a 2D array of vectors each containing one element per filter
- Batch normalization reduces the range of the activation values, reducing covariate shift
- Pooling operations reduce the size of the feature maps



### ResNet-152: A Convolutional Neural Network

Input: 224 x 224 image \_\_\_\_\_ 50k input vectors of 3 elements

Intermediate feature maps range from 112X112 vectors of depth 64 to 7X7 vectors of depth 2048 Won the 2015 ILSVRC challenge and

achieved human-level accuracy

151 convolutional layers

60 million model parameters

11 billion FLOPS

Output: 1000 floats each corresponding to a score for that category

ResNet-152 is procedurally generated using blocks of network layers that repeat



### "Res" = "Residual" Learning with Shortcuts





### Specializing Brainwave for ResNet-152



## Model Porting and Firmware Generation



### Mapping ResNet-152 to Brainwave



### Mapping ResNet-152 to Brainwave



### ResNet-152 and ResNet-50 Performance

- All convolution layers run on the FPGA
- Experiments use a batch size of 1
- Classifier runs on host computer
- Results are for the layers running on the FPGA and include data transfers
- Results on Arria 10 GX 1150 running at 300 MHZ

| ResNet variant         | ResNet-152 | ResNet-50 |  |  |
|------------------------|------------|-----------|--|--|
| Convolution Layers     | 151        | 49        |  |  |
| Inference Latency (ms) | 4          | 1.65      |  |  |
| Top-1 Accuracy (%)     | 75.4       | 73.3      |  |  |
| Reference Top-1 (%)*   | 77         | 75.3      |  |  |
| Top-5 Accuracy (%)     | 92.4       | 91.1      |  |  |
| Reference Top-5 (%)*   | 93.3       | 92.2      |  |  |

\* [github.com/KaimingHe/deep-residual-networks]

ResNet-152 reference results:

[Ma+ ISCAS 17]: 72 ms on the Arria 10 GX 1150 [Aziz+ HPCA 2019]: 35 ms on the Virtex-7 485T ResNet-50 reference results: [Chen+ FPGA 2019]: 8 ms on VU9P Our experiments: 25% faster than an NVIDIA P40



### FPGA-Accelerated CNNs in Azure

### 5 well-known convolutional neural networks

- ResNet-152
- ResNet-50
- DenseNet-121
- VGG-16
- SSD-VGG

The system includes an SDK, web-based GUI, and tutorials [https://aka.ms/aml-real-time-ai]



### Azure-Hosted ResNet-50 Based Land Classification



Created a national land cover map in about 10 minutes using \$42 of compute time

[https://blogs.microsoft.com/green/2018/05/23/

achievement-unlocked-nearly-200-million-images-into-a-national-land-cover-map-in-about-10-minutes/]



### Azure-Hosted ResNet-50 for Particle Physics

### FPGA-accelerated machine learning inference as a service for particle physics computing

Javier Duarte · Philip Harris · Scott Hauck · Burt Holzman · Shih-Chieh Hsu · Sergo Jindariani · Suffian Khan · Benjamin Kreis · Brian Lee · Mia Liu · Vladimir Lončar · Jennifer Ngadiuba · Kevin Pedro · Brandon Perez · Maurizio Pierini · Dylan Rankin · Nhan Tran · Matthew Trahms · Aristeidis Tsaris · Colin Versteeg · Ted W. Way · Dustin Werran · Zhenbin Wu

Received: - / Accepted: -

**Abstract** Large-scale particle physics experiments face challenging demands for high-throughput comput-

J.D., B.H., S.J., B.K., M.L., K.P., N.T., and A.T. are supported by Fermi Research Alliance, LLC under Contract No. DE-AC02-07CH11359 with the U.S. Department of Energy. ing resources both now and in the future. New heterogeneous computing paradigms on dedicated hardware with increased parallelization, such as Field Programmable Gate Arrays (FPGAs), offer exciting solutions with large potential gains. The growing applications of machine learning algorithms in particle physics

[https://arxiv.org/pdf/1904.08986.pdf]



2019

Apr

### FPGA-Accelerated CNNs in Azure

5 well-known convolutional neural networks

- ResNet-152
- ResNet-50
- DenseNet-121
- VGG-16
- SSD-VGG 4

The system includes an SDK, web-based GUI, and tutorials [https://aka.ms/aml-real-time-ai]

This one localizes objects in the image





### SSD-VGG For Empty Shelf Detection at the Edge



KROGER CORPORATE > INVESTOR RELATIONS > PRESS RELEASES > PRESS RELEASE

#### Kroger and Microsoft Partner to Redefine the Customer Experience and Introduce Digital Solutions for the Retail Industry

- America's largest grocery retailer and global technology company partnering to pilot two connected experience stores

- Companies will jointly bring to market Retail as a Service product for retailers and present the solution at NRF 2019: Retail's Big Show Company Release - 1/7/2019 6:30 AM ET

CINCINNATI and REDMOND, Wash., Jan. 7, 2019 /PRNewswire/ -- The Kroger Co. (NYSE: KR) and Microsoft Corp. (Nasdaq: MSFT) today announced a collaboration to redefine the customer experience using Kroger Technology products powered by Microsoft Azure, the retailer's preferred cloud platform for Retail as a Service (RaaS). Through this innovative partnership, Kroger will pilot a connected store experience and together with Microsoft, jointly market a commercial RaaS product to the industry.

[http://ir.kroger.com/file/Index?KeyFile=396285733]



### Research Topics in DNN Inference Acceleration

- Number format
- Sparse Networks
- Dynamic Networks
- Overlay Sharing Between Models



### Takeaways

FPGAs are great for neural networks because we can specialize the overlay for the network and update the overlay in place

- Can switch any FPGA to a different configuration for load balancing
- Neural networks keep changing and FPGAs allow us to keep up

Brainwave is co-designed across hardware and software to take advantage of this flexibility to perform neural network inference at a massive scale for 1<sup>st</sup> party models on Bing and 3<sup>rd</sup> party models on Azure

Brainwave is still under development and we're scaling it to better FPGA hardware and bigger models





https://www.microsoft.com/en-us/research/project/project-brainwave/ https://www.microsoft.com/en-us/research/project/project-catapult/ https://aka.ms/aml-real-time-ai

